3 Steps To Become A Data Scientist
Introduction
So you want to be a Data Scientist and don’t know where to start? Well, you’ve come to the right place.
Today you’ll learn a little about the economy, a whole bunch of data science principles and practice and where to stay on your next trip to New York City!
Data Science Pipeline
First, let’s look at the data science pipeline.
The data science pipeline starts with defining what major questions one wants to answer and subsequently acquiring and importing the relevant data to be analyzed.
Then, the data is viewed and data tidying must occur; where a rectangular data structure model is assumed and three requirements must be met. Each observation (called an entity) forms a row, each variable (called an attribute) forms a column and each observational unit (type of entity) forms a table.
Leading to the exploratory data analysis process, where the data is transformed and visualized. Data cleaning may be necessary for missing data. When handling missing data, the missing data may be removed, encoded or imputation (replace missing values with the mean of non-missing values) of a numeric variable may be necessary.
Hypothesis testing and machine learning (ML) modeling are the final steps before the data and its results can be communicated.
Image of Pipeline
Economy & Vacations
The rise of companies like Airbnb has given rise to The Sharing Economy. The sharing economy is a model defined as the facilitation of goods and services on a peer-to-peer level usually through online community platforms. This new model has made it possible for a great deal of people to gain another source of income and for you to have an affordable vacation.
As more sharing economy companies have opened, like Airbinb and Uber, the way we vacation has changed. This change has been documented and open data on it is available.
DataSet Used
The data we will be using in this tutorial is New York City Airbnb Open Data from Kaggle. We will use this data to look at the relationships between types of housing and location.
Preparing Data
Download the dataset.
In this section, we will learn how to load in our dataset, view the data in our dataset, and clean it up so it’s easy for us to work with.
First, let’s load in the following libraries so we can use certain functions:
# for data wranging
library(tidyverse)
library(dplyr)
# for data analysis
library(geosphere)
library(ggplot2)
library(broom)Loading Data
CSV files are files that include data which are “comma-separated values”, meaning that data values are literally separated by commas.
After we’ve downloaded our CSV file from Kaggle into our working directory, we can use the read_csv function to load the CSV file data into our program’s data frame, which is a table of the data.
There are some attributes that we don’t need for our purposes, like host_id, host_name, minimum_nights, number_of_reviews, last_review, reviews_per_month, and calculated_host_listings_count. So, let’s remove these from our data frame:
# a vector called "to_remove" that has the names of the attributes we don't want
to_remove <- c('host_id',
'host_name',
'minimum_nights',
'number_of_reviews',
'last_review',
'reviews_per_month',
'calculated_host_listings_count')
# removing attributes from data frame using "to_remove"
airbnb_tab = airbnb_tab[ , !(names(airbnb_tab) %in% to_remove)]Viewing Data
Here, we see the first 10 rows in our dataset:
| id | name | neighbourhood_group | neighbourhood | latitude | longitude | room_type | price | availability_365 |
|---|---|---|---|---|---|---|---|---|
| 2539 | Clean & quiet apt home by the park | Brooklyn | Kensington | 40.64749 | -73.97237 | Private room | 149 | 365 |
| 2595 | Skylit Midtown Castle | Manhattan | Midtown | 40.75362 | -73.98377 | Entire home/apt | 225 | 355 |
| 3647 | THE VILLAGE OF HARLEM….NEW YORK ! | Manhattan | Harlem | 40.80902 | -73.94190 | Private room | 150 | 365 |
| 3831 | Cozy Entire Floor of Brownstone | Brooklyn | Clinton Hill | 40.68514 | -73.95976 | Entire home/apt | 89 | 194 |
| 5022 | Entire Apt: Spacious Studio/Loft by central park | Manhattan | East Harlem | 40.79851 | -73.94399 | Entire home/apt | 80 | 0 |
| 5099 | Large Cozy 1 BR Apartment In Midtown East | Manhattan | Murray Hill | 40.74767 | -73.97500 | Entire home/apt | 200 | 129 |
| 5121 | BlissArtsSpace! | Brooklyn | Bedford-Stuyvesant | 40.68688 | -73.95596 | Private room | 60 | 0 |
| 5178 | Large Furnished Room Near B’way | Manhattan | Hell’s Kitchen | 40.76489 | -73.98493 | Private room | 79 | 220 |
| 5203 | Cozy Clean Guest Room - Family Apt | Manhattan | Upper West Side | 40.80178 | -73.96723 | Private room | 79 | 0 |
| 5238 | Cute & Cozy Lower East Side 1 bdrm | Manhattan | Chinatown | 40.71344 | -73.99037 | Entire home/apt | 150 | 188 |
Some Notes:
knitr::kable() is used to make the table “pretty” and easier to read
head(df, n=10) is used to view the dataframe with a specific number of rows (head() is not always necessary, you can just list the data frame for it to render)
df is where the dataframe goes, in this case airbnb_tab
n = determines the number of rows visible, in this case 10 The following is a list of descriptions for the attributes of our data set:
| Attribute | Description/Unit |
|---|---|
id |
Unique ID for each Airbnb listing |
name |
Name or description of the Airbnb listing |
neighbourhood_group |
Boroughs of New York (Manhattan, Brooklyn, Queens, Bronx, Staten Island) |
neighbourhood |
Neighborhoods of New York |
latitude |
Degrees of latitude, measures distance North and South from Equator |
longitude |
Degrees of longitude, measure distance East and West of Prime Meridian |
room_type |
Type of space offered (Entire home/apt, Private room, Shared room) |
price |
Price of listing, in US Dollars |
availability_365 |
Number of days in a year when the listing is available for booking |
Tidying Data
Tidying Data entails the elements listed in the list below.
Elements of a tidy dataset: 1. Each observation/entity forms a row 1. Each variable/attribute forms a column 1. Each observational unit (type of entity) forms a column (i.e. not dependent on one another)
Our dataset is already tidy and meets the criteria above. Each entity is a row and each attribute is a column, where no entity is dependent on another.
However, if your data set is untidy, below is an example on a different small dataset, to show you what to do.
Sample Tidying
Exploratory Data Analysis
In this section, we begin exploring what our data can tell us using visualizations. This will help us to better understand our data and help us make decisions about how we may want to further manipulate the data to see something specific, or decide which methods are best for modelling and Machine Learning!
The main reason for exploratory data analysis, or EDA, is to help us find any problems in our data preparation and gain a sense of variable properties, such as central trends (mean), spread (variance), skew, outliers, and relationships between pairs of variables, like their correlation or covariance.
You can read more about EDA at CMSC 320 EDA Lecture Notes by Professor Hector Corrado Bravo.
Handling Missing Data
Recall that the attribute availability_365 tells us how many days in the year that this particular listing is available for people to book.
Notice that 0 is a value for some of the entities (Airbnb listings). It doesn’t make much sense for us to look at entities that aren’t available at all during the year. In fact, more than 17000 entities are listed at being available for 0 days out of the year! That’s about 1/3 of our dataset.
We’ll call this “missing data”, and remove these entities from our dataset:
airbnb_tab <- airbnb_tab %>%
filter(availability_365 > 0) # filter() is used to filter the dataframe via specific conditions
knitr::kable(head(airbnb_tab, n=10))| id | name | neighbourhood_group | neighbourhood | latitude | longitude | room_type | price | availability_365 |
|---|---|---|---|---|---|---|---|---|
| 2539 | Clean & quiet apt home by the park | Brooklyn | Kensington | 40.64749 | -73.97237 | Private room | 149 | 365 |
| 2595 | Skylit Midtown Castle | Manhattan | Midtown | 40.75362 | -73.98377 | Entire home/apt | 225 | 355 |
| 3647 | THE VILLAGE OF HARLEM….NEW YORK ! | Manhattan | Harlem | 40.80902 | -73.94190 | Private room | 150 | 365 |
| 3831 | Cozy Entire Floor of Brownstone | Brooklyn | Clinton Hill | 40.68514 | -73.95976 | Entire home/apt | 89 | 194 |
| 5099 | Large Cozy 1 BR Apartment In Midtown East | Manhattan | Murray Hill | 40.74767 | -73.97500 | Entire home/apt | 200 | 129 |
| 5178 | Large Furnished Room Near B’way | Manhattan | Hell’s Kitchen | 40.76489 | -73.98493 | Private room | 79 | 220 |
| 5238 | Cute & Cozy Lower East Side 1 bdrm | Manhattan | Chinatown | 40.71344 | -73.99037 | Entire home/apt | 150 | 188 |
| 5295 | Beautiful 1br on Upper West Side | Manhattan | Upper West Side | 40.80316 | -73.96545 | Entire home/apt | 135 | 6 |
| 5441 | Central Manhattan/near Broadway | Manhattan | Hell’s Kitchen | 40.76076 | -73.98867 | Private room | 85 | 39 |
| 5803 | Lovely Room 1, Garden, Best Area, Legal rental | Brooklyn | South Slope | 40.66829 | -73.98779 | Private room | 89 | 314 |
Note that a way to handle missing data, as mentioned in the data science pipline section (data cleaning), is removing missing data altogether. Having 0 as a value for availability_365 is a form of missing data.
Visualizations
Interactive Map
library(leaflet)
# Creating NYC Map
nyc_map <- leaflet(airbnb_tab) %>%
addTiles() %>%
setView(
lat=40.730610,
lng=-73.935242,
zoom=11)
nyc_mapleaflet(airbnb_tab) %>%
addTiles() %>%
addAwesomeMarkers(
lng = ~longitude,
lat = ~latitude,
icon = awesomeIcons(
icon = 'ios-close',
iconColor = 'black',
library = 'ion',
markerColor = ~ifelse(room_type == 'Entire home/apt', "green",
ifelse(room_type =='Private room', "orange",
"red"
)
)
),
## Price Label
label=~as.character(price),
## Clustering for identifying arrest density
clusterOptions = markerClusterOptions()
) %>%
addLegend(
position = 'bottomright',
colors= c("green", "orange", "red"), labels=c("Entire Home/Apt", "Private Room", "Shared Room"),
title='Types of Rentals',
)Histograms
library(ggplot2)
library(ggthemes)
airbnb_home <- airbnb_tab %>%
filter(room_type == 'Entire home/apt')
knitr::kable(head(airbnb_home, n=10))| id | name | neighbourhood_group | neighbourhood | latitude | longitude | room_type | price | availability_365 |
|---|---|---|---|---|---|---|---|---|
| 2595 | Skylit Midtown Castle | Manhattan | Midtown | 40.75362 | -73.98377 | Entire home/apt | 225 | 355 |
| 3831 | Cozy Entire Floor of Brownstone | Brooklyn | Clinton Hill | 40.68514 | -73.95976 | Entire home/apt | 89 | 194 |
| 5099 | Large Cozy 1 BR Apartment In Midtown East | Manhattan | Murray Hill | 40.74767 | -73.97500 | Entire home/apt | 200 | 129 |
| 5238 | Cute & Cozy Lower East Side 1 bdrm | Manhattan | Chinatown | 40.71344 | -73.99037 | Entire home/apt | 150 | 188 |
| 5295 | Beautiful 1br on Upper West Side | Manhattan | Upper West Side | 40.80316 | -73.96545 | Entire home/apt | 135 | 6 |
| 6848 | Only 2 stops to Manhattan studio | Brooklyn | Williamsburg | 40.70837 | -73.95352 | Entire home/apt | 140 | 46 |
| 7097 | Perfect for Your Parents + Garden | Brooklyn | Fort Greene | 40.69169 | -73.97185 | Entire home/apt | 215 | 321 |
| 7726 | Hip Historic Brownstone Apartment with Backyard | Brooklyn | Crown Heights | 40.67592 | -73.94694 | Entire home/apt | 99 | 21 |
| 7750 | Huge 2 BR Upper East Cental Park | Manhattan | East Harlem | 40.79685 | -73.94872 | Entire home/apt | 190 | 249 |
| 8490 | MAISON DES SIRENES1,bohemian apartment | Brooklyn | Bedford-Stuyvesant | 40.68371 | -73.94028 | Entire home/apt | 120 | 233 |
airbnb_home %>%
ggplot(aes(x = neighbourhood_group, y = price)) +
geom_boxplot()+
coord_flip() +
theme_economist() +
scale_fill_economist() +
labs(title = "Entire Homes & Appts. Price By Neighborhood in 2019",
x = "Major Neighborhood Groups",
y = "Price(USD)")airbnb_home %>%
ggplot(aes(x = neighbourhood_group, y = price)) +
geom_boxplot()+
scale_y_continuous(limits = c(0, 1500)) +
coord_flip() +
theme_economist() +
scale_fill_economist() +
labs(title = "2019 NYC Homes & Appts. Prices (Up to $1500/night)",
x = "Major Neighborhood Groups",
y = "Price(USD)")airbnb_room <- airbnb_tab %>%
filter(room_type == 'Private room')
knitr::kable(head(airbnb_room, n=10))| id | name | neighbourhood_group | neighbourhood | latitude | longitude | room_type | price | availability_365 |
|---|---|---|---|---|---|---|---|---|
| 2539 | Clean & quiet apt home by the park | Brooklyn | Kensington | 40.64749 | -73.97237 | Private room | 149 | 365 |
| 3647 | THE VILLAGE OF HARLEM….NEW YORK ! | Manhattan | Harlem | 40.80902 | -73.94190 | Private room | 150 | 365 |
| 5178 | Large Furnished Room Near B’way | Manhattan | Hell’s Kitchen | 40.76489 | -73.98493 | Private room | 79 | 220 |
| 5441 | Central Manhattan/near Broadway | Manhattan | Hell’s Kitchen | 40.76076 | -73.98867 | Private room | 85 | 39 |
| 5803 | Lovely Room 1, Garden, Best Area, Legal rental | Brooklyn | South Slope | 40.66829 | -73.98779 | Private room | 89 | 314 |
| 6021 | Wonderful Guest Bedroom in Manhattan for SINGLES | Manhattan | Upper West Side | 40.79826 | -73.96113 | Private room | 85 | 333 |
| 7322 | Chelsea Perfect | Manhattan | Chelsea | 40.74192 | -73.99501 | Private room | 140 | 12 |
| 8024 | CBG CtyBGd HelpsHaiti rm#1:1-4 | Brooklyn | Park Slope | 40.68069 | -73.97706 | Private room | 130 | 347 |
| 8025 | CBG Helps Haiti Room#2.5 | Brooklyn | Park Slope | 40.67989 | -73.97798 | Private room | 80 | 364 |
| 8110 | CBG Helps Haiti Rm #2 | Brooklyn | Park Slope | 40.68001 | -73.97865 | Private room | 110 | 304 |
airbnb_room %>%
ggplot(aes(x = neighbourhood_group, y = price)) +
geom_boxplot()+
coord_flip() +
theme_economist() +
scale_fill_economist() +
labs(title = "Private Room Price By Neighborhood in 2019",
x = "Major Neighborhood Groups",
y = "Price(USD)")airbnb_room %>%
ggplot(aes(x = neighbourhood_group, y = price)) +
geom_boxplot()+
scale_y_continuous(limits = c(0, 500)) +
coord_flip() +
theme_economist() +
scale_fill_economist() +
labs(title = "2019 NYC Private Room Prices (Up to $500/night)",
x = "Major Neighborhood Groups",
y = "Price(USD)")airbnb_sroom <- airbnb_tab %>%
filter(room_type == 'Shared room')
knitr::kable(head(airbnb_sroom, n=10))| id | name | neighbourhood_group | neighbourhood | latitude | longitude | room_type | price | availability_365 |
|---|---|---|---|---|---|---|---|---|
| 12048 | LowerEastSide apt share shortterm 1 | Manhattan | Lower East Side | 40.71401 | -73.98917 | Shared room | 40 | 188 |
| 54453 | MIDTOWN WEST - Large alcove studio | Manhattan | Hell’s Kitchen | 40.76548 | -73.98474 | Shared room | 105 | 363 |
| 173072 | Cozy Pre-War Harlem Apartment | Manhattan | Harlem | 40.80827 | -73.95329 | Shared room | 49 | 248 |
| 391948 | Single Room | Queens | Ozone Park | 40.68581 | -73.84642 | Shared room | 45 | 364 |
| 467634 | yahmanscrashpads | Queens | Jamaica | 40.67747 | -73.76493 | Shared room | 39 | 353 |
| 564751 | Artist space for creative nomads. | Manhattan | Upper West Side | 40.80165 | -73.96287 | Shared room | 76 | 324 |
| 737126 | Williamsburg Loft!! Bedford L 1blk! | Brooklyn | Williamsburg | 40.71714 | -73.95447 | Shared room | 195 | 364 |
| 765203 | Art Lover’s Abode Brooklyn | Brooklyn | Williamsburg | 40.70745 | -73.94307 | Shared room | 52 | 88 |
| 773497 | Great spot in Brooklyn | Brooklyn | Bedford-Stuyvesant | 40.69407 | -73.94551 | Shared room | 200 | 365 |
| 819206 | Cute shared studio apartment | Manhattan | East Harlem | 40.79106 | -73.95058 | Shared room | 45 | 313 |
airbnb_sroom %>%
ggplot(aes(x = neighbourhood_group, y = price)) +
geom_boxplot()+
coord_flip() +
theme_economist() +
scale_fill_economist() +
labs(title = "Shared Room Price By Neighborhood in 2019",
x = "Major Neighborhood Groups",
y = "Price(USD)")airbnb_sroom %>%
ggplot(aes(x = neighbourhood_group, y = price)) +
geom_boxplot()+
scale_y_continuous(limits = c(0, 200)) +
coord_flip() +
theme_economist() +
scale_fill_economist() +
labs(title = "2019 NYC Shared Room Prices (Up to $200/night)",
x = "Major Neighborhood Groups",
y = "Price(USD)")Hypothesis Testing & Machine Learning
Linear Regression
With datasets that are large, it can be very useful to generate a linear regression, or a line of “best fit”, for an easier interpretation of the data. This data analysis technique is also an effective way to learn about general trends of our data set and lets us construct confidence intervals and do hypothesis testing, which analyzes and tests for relationships between variables.
We want to look at the relationship between price and distance away from Times Square in New York City, one of the largest populated cities in New York. We are looking at Time Square since it is a major commercial intersection, tourist destination, entertainment center, and neighborhood in the Midtown Manhattan section of NYC (Wikipedia).
For these reasons, we would like to see if Airbnb listings would increase as their distance to Times Square (latitude 40.757, longitude -73.986) decreases, and vice versa. We will be using functions from the geosphere library to calculate distance between coordinates.
First, let’s add an attribute called distToTimesSquare in our dataset. This will contain the distance (in miles) between each listing and Times Square.
coordsTimeSquare <- c(-73.986, 40.757)
airbnb_tab <- airbnb_tab %>%
mutate(distToTimesSquare = by(airbnb_tab, 1:nrow(airbnb_tab),
function(row) {
distHaversine(c(row$longitude, row$latitude), coordsTimeSquare)
}) / 1609) # divide by 1609 to convert meters to miles
knitr::kable(head(airbnb_tab))| id | name | neighbourhood_group | neighbourhood | latitude | longitude | room_type | price | availability_365 | distToTimesSquare |
|---|---|---|---|---|---|---|---|---|---|
| 2539 | Clean & quiet apt home by the park | Brooklyn | Kensington | 40.64749 | -73.97237 | Private room | 149 | 365 | 7.6101584 |
| 2595 | Skylit Midtown Castle | Manhattan | Midtown | 40.75362 | -73.98377 | Entire home/apt | 225 | 355 | 0.2614253 |
| 3647 | THE VILLAGE OF HARLEM….NEW YORK ! | Manhattan | Harlem | 40.80902 | -73.94190 | Private room | 150 | 365 | 4.2767099 |
| 3831 | Cozy Entire Floor of Brownstone | Brooklyn | Clinton Hill | 40.68514 | -73.95976 | Entire home/apt | 89 | 194 | 5.1585482 |
| 5099 | Large Cozy 1 BR Apartment In Midtown East | Manhattan | Murray Hill | 40.74767 | -73.97500 | Entire home/apt | 200 | 129 | 0.8654731 |
| 5178 | Large Furnished Room Near B’way | Manhattan | Hell’s Kitchen | 40.76489 | -73.98493 | Private room | 79 | 220 | 0.5487460 |
Second, let’s split our current airbnb_tab data frame into two data frames, one with room_type == "Entire home/apt" and one with room_type == "Private room". This is because prices are much more expensive for “Entire home/apt” listings, so we don’t want to get confused when regressing against distance. We only want to see the relation between distance and prices, not between prices and size of the space being listed!
# create new dataframe of listings where room_type=="Entire home/apt"
entire_tab <- airbnb_tab %>%
filter(room_type == "Entire home/apt")
# create new dataframe of listings where room_type=="Private room"
private_tab <- airbnb_tab %>%
filter(room_type == "Private room")
shared_tab <- airbnb_tab %>%
filter(room_type == "Shared room")
knitr::kable(head(entire_tab))| id | name | neighbourhood_group | neighbourhood | latitude | longitude | room_type | price | availability_365 | distToTimesSquare |
|---|---|---|---|---|---|---|---|---|---|
| 2595 | Skylit Midtown Castle | Manhattan | Midtown | 40.75362 | -73.98377 | Entire home/apt | 225 | 355 | 0.2614253 |
| 3831 | Cozy Entire Floor of Brownstone | Brooklyn | Clinton Hill | 40.68514 | -73.95976 | Entire home/apt | 89 | 194 | 5.1585482 |
| 5099 | Large Cozy 1 BR Apartment In Midtown East | Manhattan | Murray Hill | 40.74767 | -73.97500 | Entire home/apt | 200 | 129 | 0.8654731 |
| 5238 | Cute & Cozy Lower East Side 1 bdrm | Manhattan | Chinatown | 40.71344 | -73.99037 | Entire home/apt | 150 | 188 | 3.0224159 |
| 5295 | Beautiful 1br on Upper West Side | Manhattan | Upper West Side | 40.80316 | -73.96545 | Entire home/apt | 135 | 6 | 3.3701851 |
| 6848 | Only 2 stops to Manhattan studio | Brooklyn | Williamsburg | 40.70837 | -73.95352 | Entire home/apt | 140 | 46 | 3.7708536 |
| id | name | neighbourhood_group | neighbourhood | latitude | longitude | room_type | price | availability_365 | distToTimesSquare |
|---|---|---|---|---|---|---|---|---|---|
| 2539 | Clean & quiet apt home by the park | Brooklyn | Kensington | 40.64749 | -73.97237 | Private room | 149 | 365 | 7.610158 |
| 3647 | THE VILLAGE OF HARLEM….NEW YORK ! | Manhattan | Harlem | 40.80902 | -73.94190 | Private room | 150 | 365 | 4.276710 |
| 5178 | Large Furnished Room Near B’way | Manhattan | Hell’s Kitchen | 40.76489 | -73.98493 | Private room | 79 | 220 | 0.548746 |
| 5441 | Central Manhattan/near Broadway | Manhattan | Hell’s Kitchen | 40.76076 | -73.98867 | Private room | 85 | 39 | 0.295381 |
| 5803 | Lovely Room 1, Garden, Best Area, Legal rental | Brooklyn | South Slope | 40.66829 | -73.98779 | Private room | 89 | 314 | 6.138165 |
| 6021 | Wonderful Guest Bedroom in Manhattan for SINGLES | Manhattan | Upper West Side | 40.79826 | -73.96113 | Private room | 85 | 333 | 3.137898 |
| id | name | neighbourhood_group | neighbourhood | latitude | longitude | room_type | price | availability_365 | distToTimesSquare |
|---|---|---|---|---|---|---|---|---|---|
| 12048 | LowerEastSide apt share shortterm 1 | Manhattan | Lower East Side | 40.71401 | -73.98917 | Shared room | 40 | 188 | 2.978924 |
| 54453 | MIDTOWN WEST - Large alcove studio | Manhattan | Hell’s Kitchen | 40.76548 | -73.98474 | Shared room | 105 | 363 | 0.590397 |
| 173072 | Cozy Pre-War Harlem Apartment | Manhattan | Harlem | 40.80827 | -73.95329 | Shared room | 49 | 248 | 3.939358 |
| 391948 | Single Room | Queens | Ozone Park | 40.68581 | -73.84642 | Shared room | 45 | 364 | 8.821836 |
| 467634 | yahmanscrashpads | Queens | Jamaica | 40.67747 | -73.76493 | Shared room | 39 | 353 | 12.832089 |
| 564751 | Artist space for creative nomads. | Manhattan | Upper West Side | 40.80165 | -73.96287 | Shared room | 76 | 324 | 3.318301 |
Third, we want to create a scatter plot of the prices of listings against their distance to Times Square. We’ll also add a regression line to this scatter plot to the general increasing or decreasing trend in our data! Let’s do this three times, once for each room_type we are interested in.
entire_tab %>%
ggplot(aes(x=entire_tab$distToTimesSquare,y=entire_tab$price)) +
geom_point() + # plot points for scatter plot
geom_smooth(method=lm) + # plot linear regression line or line of best fit
ylim(0, 1500) + # set the upper limit of prices to $1500
labs(title="Homes & Appts. Prices vs Distance to Times Square", x="Distance to Times Square (miles)", y="Price (USD)")private_tab %>%
ggplot(aes(x=private_tab$distToTimesSquare,y=private_tab$price)) +
geom_point() + # plot points for scatter plot
geom_smooth(method=lm) + # plot linear regression line or line of best fit
ylim(0, 500) + # set the upper limit of prices to $500
labs(title="Private Room Prices vs Distance to Times Square", x="Distance to Times Square (miles)", y="Price (USD)")shared_tab %>%
ggplot(aes(x=shared_tab$distToTimesSquare,y=shared_tab$price)) +
geom_point() + # plot points for scatter plot
geom_smooth(method=lm) + # plot linear regression line or line of best fit
ylim(0, 200) + # set the upper limit of prices to $200
labs(title="Shared Room Prices vs Distance to Times Square", x="Distance to Times Square (miles)", y="Price (USD)")Lastly, let’s analyze the resulting models quantitatively using broom::tidy.
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 4.43 0.0284 156. 0.
## 2 price -0.00149 0.0000760 -19.6 6.45e-85
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 5.44 0.0277 197. 0.
## 2 price -0.00237 0.000141 -16.8 7.26e-63
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 5.41 0.128 42.3 2.95e-212
## 2 price -0.00555 0.00109 -5.11 3.93e- 7
As we can see in all three of these linear regression plots, the prices of all the types of listing decreases slowly as the location of the listing gets further away from Times Square. From the models, it is clear that prices of Airbnb listings decrease by 0.00149 (homes and apts), 0.00237 (private rooms), and 0.00555 (shared rooms) on average each mile further away from Times Square.
Even though we can clearly see a trend in our linear regressions, it is best to conduct hypothesis testing in order to determine if our results are valid and there is a significantly meaningful relationship between Airbnb prices and their distance away from high traffic locations, such as Times Square in New York City (Statistics How To).
Let’s ask the question: Do we reject the null hypothesis of no relationship between price and distance from Times Square?
Our answer: Yes, we reject the null hypothesis since the p-values for all three linear regressions are significantly smaller than 0.05. A p-value less than or equal to 0.05 means that the results for our data holds, that our data is repeatable, and that our results didn’t just happen by chance (Statistics How To).
You can read more about Linear Regression at CMSC 320 Linear Regression Lecture Notes by Professor Hector Corrada Bravo.